Back

Journal of Chemical Information and Modeling

American Chemical Society (ACS)

Preprints posted in the last 90 days, ranked by how well they match Journal of Chemical Information and Modeling's content profile, based on 207 papers previously published here. The average preprint has a 0.21% match score for this journal, so anything above that is already an above-average fit.

1
Learning fragment-based segmentation of binding sites from molecular dynamics: a proof-of-concept on cardiac myosin.

Yang, Y.-Y.; Pickersgill, R. W.; Fornili, A.

2026-02-16 bioinformatics 10.64898/2026.02.13.703009 medRxiv
Top 0.1%
69.6%
Show abstract

The geometric and chemical features of protein binding sites tend to change as a consequence of conformational dynamics. In the ligand-unbound (apo) state, a binding site might be only transiently organised in a way that can accommodate a given ligand, with the relevant regions of the protein coming together in a suitable arrangement only in a subset of conformations. Ligand binding itself can also induce further changes in the binding site. Because most ligands can be decomposed into smaller fragments, we hypothesised that mapping onto the binding site surface the propensity of binding specific fragments could be used to monitor changes in the overall ability of the site to bind a ligand. This task can be formulated as semantic segmentation, which can now be performed successfully using deep learning methods. Here we introduce the Fragment-Based protein Ensemble semantic Segmentation Tool for Myosin (FragBEST-Myo), a deep learning method based on a 3D U-Net architecture, trained to partition the omecamtiv mecarbil (OM) binding site of cardiac myosin into fragment-specific regions using only local shape and physico-chemical features. The model was trained on labelled Molecular Dynamics (MD) trajectories of OM-bound myosin in both post-rigor (PR) and pre-power-stroke (PPS) states, achieving an accuracy of ~95% and a mean Intersection over Union (mIoU) > 0.75 on unseen trajectories from both states. When applied to apo trajectories, FragBEST-Myo-derived descriptors produced rankings consistent with similarity to holo conformations. Moreover, selecting apo frames based on FragBEST-Myo ranking increased the chance of recovering holo-like OM docking poses relative to randomly chosen control frames, supporting its use as a screening tool for ensemble docking. Beyond frame selection, fragment maps provide a compact representation to assess docking poses and to guide fragment-based design. Our proof-of-concept provides a basis for developing future general models applicable to a broader range of proteins and ligands, with the fragment-based formulation offering a natural route to generalisation. Graphical Abstract O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=103 SRC="FIGDIR/small/703009v1_ufig1.gif" ALT="Figure 1"> View larger version (32K): org.highwire.dtl.DTLVardef@1551255org.highwire.dtl.DTLVardef@26bf73org.highwire.dtl.DTLVardef@1e3181forg.highwire.dtl.DTLVardef@44baf7_HPS_FORMAT_FIGEXP M_FIG C_FIG

2
Evolutionary exploration of drug-like chemical space utilizing generative AI and virtual screening

Secker, C.; Secker, P.; Yergoez, F.; Celik, M. O.; Chewle, S.; Phuong Nga Le, M.; Masoud, M.; Christgau, S.; Weber, M.; Gorgulla, C.; Nigam, A.; Pollice, R.; Schuette, C.; Fackeldey, K.

2026-03-30 bioinformatics 10.64898/2026.03.26.714527 medRxiv
Top 0.1%
62.8%
Show abstract

The identification of suitable lead molecules in the vast chemical space is a critical and challenging task in drug discovery campaigns. Recently, it has been demonstrated that large-scale virtual screening provides a powerful approach to accelerate the identification of novel drug candidates by screening ever increasing virtual ligand libraries, which have reached magnitudes of > 1020 compounds. However, this desirable increase in potentially bioactive molecules poses a new challenge as enumerating and virtually screening such huge compound libraries is computationally prohibitive. Consequently, advanced approaches to navigate ultra-large chemical spaces and to identify suitable candidate molecules therein are urgently needed. Here, we present an evolutionary algorithm framework using molecular generative AI, reaction-based substructure searching, and iterative model fine-tuning for a targeted and efficient exploration of chemical fragment spaces. Combining this approach with large-scale virtual screening we are able to identify target-specific candidate molecules within the commercially available Enamine REAL Space ([~]1015). We demonstrate the applicability of the approach by successfully identifying and biochemically validating pH-specific ligands of the {micro}-opioid receptor. Our results demonstrate that integrating generative AI with evolutionary algorithms provides a promising route to explore ultra-large chemical spaces for the discovery of novel, synthetically accessible lead molecules.

3
G-screen: Scalable Receptor-Aware Virtual Screening through Flexible Ligand Alignment

Jung, N.; Park, H.; Yang, J.; Seok, C.

2026-03-05 biophysics 10.64898/2026.03.03.707320 medRxiv
Top 0.1%
62.7%
Show abstract

Virtual screening has long been a central computational tool for rational ligand discovery, enabling the systematic prioritization of candidate molecules from large chemical libraries. Although docking and related approaches that explicitly account for receptor-ligand interactions have been developed and refined over several decades, achieving both reliable receptor-aware interaction modeling and computational scalability remains an open challenge, particularly for ultra-large chemical spaces. Ligand-based methods are fast and robust but do not explicitly incorporate receptor structure, whereas docking-based approaches model receptor-ligand interactions more directly at substantially higher computational cost. Here, we present G-screen, a freely available and scalable receptor-aware virtual screening framework designed for cases in which a reference protein-ligand complex structure is available. Instead of performing full docking, G-screen rapidly aligns candidate ligands to the reference ligand using a flexible global alignment algorithm (G-align) and evaluates receptor-aware pharmacophore interactions derived from the reference complex, thereby combining the efficiency of ligand-based alignment with explicit atomic-level interaction analysis. Benchmarking on DUD-E, LIT-PCBA, and MUV datasets demonstrates that G-screen achieves competitive discrimination and early enrichment relative to representative ligand-based and docking-based methods, while maintaining millisecond-scale per-molecule runtimes under multi-threaded execution. These results position G-screen as a practical and scalable receptor-aware screening strategy for efficiently filtering large chemical libraries when a reference complex structure is available. Scientific ContributionWe have developed a scalable virtual screening framework for efficiently filtering ultra-large chemical libraries using a flexible global alignment algorithm combined with receptor-aware pharmacophore evaluations. Despite explicitly capturing atomic-level interactions, the screening process using this method is highly efficient, maintaining millisecond-scale per-molecule runtimes under parallel execution. It achieves competitive discrimination and early enrichment, successfully bridging the speed of ligand-based approaches with the structural context of traditional docking.

4
Structure-Based and Stability-Validated Prioritization of BACE1 Inhibitors Integrating Meta-Ensemble QSAR and Molecular Dynamics

Chowdhury, T. D.; Shafoyat, M. U.; Hemel, N. H.; Nizam, D.; Sajib, J. H.; Toha, T. I.; Nyeem, T. A.; Farzana, M.; Haque, S. R.; Hasan, M.; Siddiquee, K. N. e. A.; Mannoor, K.

2026-04-10 bioinformatics 10.64898/2026.04.07.716920 medRxiv
Top 0.1%
55.6%
Show abstract

Alzheimers disease remains a major therapeutic challenge, and no {beta}-secretase (BACE1) inhibitor has achieved clinical approval. A key limitation of prior discovery efforts is reliance on single-parameter optimization, often resulting in candidates with limited translational potential. In this study, we developed a biology-informed computational framework integrating meta-ensemble QSAR modeling, molecular docking, Protein Language Model (ESM-1b)-guided residue interaction weighting, and ADMET profiling within a normalized multi-parameter ranking scheme. Model performance was validated using cross-validation, external validation, and Y-randomization (n = 100; p = 0.009), while applicability domain analysis based on Tanimoto similarity highlighted reduced reliability for extrapolative predictions. Sensitivity analysis showed high ranking stability under moderate perturbations (Spearman {rho} = 0.998 for {+/-}10%; 0.963 for {+/-}25%), with reduced agreement under randomized weighting ({rho} = 0.821), indicating that prioritization is robust but influenced by weight selection. Screening of 16,196 compounds identified 153 predicted actives (accuracy = 0.852; ROC-AUC = 0.920), which were refined to 111 candidates and seven prioritized leads. Molecular dynamics simulations (200 ns) indicated stable binding and persistent catalytic interactions, with Mol-2 showing favorable dynamic stability and ADMET characteristics. Overall, this study presents an interpretable and quantitatively evaluated framework for multi-parameter compound prioritization, supporting more reliable virtual screening in early-stage CNS drug discovery.

5
Circumventing the synthesizability problem in generative molecular design

Weller, J. A.; Li, J.; Jiang, Y.; Rohs, R.

2026-02-19 bioinformatics 10.64898/2026.02.18.706722 medRxiv
Top 0.1%
54.0%
Show abstract

Generative structure-based drug design (SBDD) models have shown great promise to accelerate our ability to discover novel drug candidates. However, these models have been criticized for producing compounds that are not very synthesizable, and therefore not practically applicable to drug design. In this work, we propose a way to circumvent the synthesizability issue by introducing a model-guided virtual screening (MGVS) pipeline which pairs SBDD models with efficient chemical similarity search methods to identify synthesizable analogs of generated compounds in existing ultra-large compound databases. Using this approach, we demonstrate that synthesizable analogs of generated compounds with equivalent or better docking scores and similar predicted binding poses can be reliably identified across a wide range of protein targets. We find that MGVS outperforms standard virtual ligand screening (VLS), consistently yielding at least a 25x improvement in screening efficiency across three different SBDD models. As drug-like chemical spaces continue to grow and standard VLS methods focused on exhaustive screening become increasingly impractical, approaches like MGVS that effectively narrow the search space will become critical for advancing drug discovery.

6
Revealing imatinib-kinase specificity via analyzing changes in protein dynamics and computing molecular binding affinity

Troxel, W.; Vig, E.; Chang, C.-e.

2026-02-04 biochemistry 10.64898/2026.02.02.703340 medRxiv
Top 0.1%
52.9%
Show abstract

Drug promiscuity is a double-edged sword where a small molecule acts on multiple biological targets to induce toxicological or therapeutic benefits. It is possible to exploit promiscuity to expand treatment options without the prohibitive costs of designing a new drug. Imatinib is a representative case, exhibiting varied affinities and inhibitions to different kinases. It binds most favorably to Abl and Kit kinases, intermediately to Chk1 and Lck kinases, and least favorably to p38 and Src kinases. The strongly conserved features of the ATP-binding site render imatinibs molecular binding determinants unclear despite over 25 years of interrogation. To address this question, molecular thermodynamics, force distribution analysis, residue sidechain dihedral correlations, and principal component analysis were computed using trajectories from all-atom molecular dynamics simulations in explicit solvent. The results of these simulations agree with experimental affinity and binding data, enabling highly predictive factors for imatinibs binding specificity from free- and bound-state simulations through a global protein network of protein-ligand interactions, changes in sidechain dihedral correlations, and shifts in the secondary motifs modulating binding site access corresponding with well-characterized kinase "breathing motions." The sidechain dihedral correlation network also identifies distal mutants known to reduce patients imatinib sensitivity. Higher imatinib-kinase affinity trends with a loss in sidechain dihedral correlations and diminished secondary motif migration following binding, corresponding with more restricted configurations, to reduce solvent approach and ATP competition. Lower-affinity proteins show enhanced sidechain dihedral correlation and exaggerated secondary motif motions. This is consistent with a tendency to expose the protein pocket, facilitate solvent entrance, and increase ATP competition. Using imatinib as a model system, this study shows residue correlation, force interaction, and essential principal components can effectively forecast imatinib-kinase binding specificity and introduces an effective approach to repurpose and design high-affinity binders for off-target applications more generally.

7
MOZAIC: Compound Growth via In Silico Reactions and Global Optimization using Conformational Space Annealing

Yoo, J.; Shin, W.-H.

2026-03-10 bioinformatics 10.64898/2026.03.07.710272 medRxiv
Top 0.1%
52.2%
Show abstract

MotivationFragment-based drug discovery (FBDD) is an efficient strategy that leverages small molecular fragments to explore broader chemical space by combining them. Advances in computational methods have enabled the calculation of molecular properties and docking scores, thereby accelerating the development of algorithm- and AI-based approaches in FBDD. However, it should be noted that certain methods do not provide synthetic pathways to obtain the proposed compounds. Consequently, these molecules might not be synthesized easily. ResultsIn light of these developments, we propose MOZAIC, a novel framework that explores chemical space using a reaction-based fragment growing and Conformational Space Annealing, a powerful global optimization algorithm. Our results show that MOZAIC effectively produces chemically diverse molecules with balanced improvements in lead-like properties, including QED, synthetic accessibility, and binding affinity. Furthermore, its flexible objective function allows fine-tuning for specific design goals, such as enhancing solubility with binding affinity. These capabilities position MOZAIC as a valuable platform for advancing fragment-to-lead and lead optimization efforts in drug discovery. Availability and implementationMOZAIC is available at https://github.com/kucm-lsbi/MOZAIC/. Supplementary InformationSupplementary data are available at Bioinformatics online.

8
A Non-Alchemical Absolute Binding Free Energy Framework for Small Molecule Drugs

Shi, Y.; Li, J.

2026-02-19 biophysics 10.64898/2026.02.18.706686 medRxiv
Top 0.1%
51.1%
Show abstract

A fully physical, non-alchemical framework is presented for absolute binding free energy (ABFE) calculations for protein-ligand complexes. Incorporation of a regularization potential eliminates unphysical artifacts and provides several key advantages: no endpoint catastrophes, fast convergence, robust performance for charged and neutral ligands, and rapid upfront verification. Validated on 30 diverse systems, the method improves predictive accuracy by 15.6% and numerical stability by 17.1% over leading alchemical approaches. This non-alchemical ABFE framework can be a potentially accurate and robust tool, carrying a potential for future computer aided small-molecule drug design.

9
Macro-Equi-Diff (MED): Scaffold-based Macrocycles Generation Using Equivariant Diffusion

Kambampati, S. S.; Anumandla, S.; Guttula, S. L.; Kavadi, V. R.; Gogte, S.; Kondaparthi, V.

2026-02-06 bioinformatics 10.64898/2026.02.05.703948 medRxiv
Top 0.1%
44.8%
Show abstract

Macrocyclic compounds are essential in drug discovery as they can modulate protein-protein interactions and enhance selectivity. Their structural complexity enables access to molecular diversity beyond traditional small molecules; however, designing feasible macrocycles remains a challenging task. Current computational methods often fail to generate macrocycles with proper drug-like properties. Here, we present Macro-Equi-Diff (MED), a deep learning framework that combines transformer-based site identification with an E(3)-equivariant Diffusion Model (EDM) for linker creation, and a fragment-linker attachment module. MED transforms acyclic molecules into structurally consistent macrocycles. MED was tested on the ZINC dataset, achieving high validity (93.92%), uniqueness (99.94%), macrocyclization (99.92%), and linker novelty (82.81%). MED improves upon previous methods that lack a macrocyclic geometry context. As a case study, MED was used to macrocyclize four acyclic drugs targeting the JAK2 protein. The generated macrocycles exhibited favourable molecular descriptors and strong binding affinities, establishing MED as a reliable method for expanding the macrocyclic chemical space.

10
AI-Based Methods for Cryptic Pocket Detection Are Fast and Qualitative Compared to Quantitatively Predictive Simulations

Zhang, S.; Miller, J. J.; Bowman, G. R.

2026-01-23 biophysics 10.64898/2026.01.21.700870 medRxiv
Top 0.1%
43.5%
Show abstract

Artificial intelligence (AI) models have advanced rapidly, driving breakthroughs in protein structure prediction, functional annotation, and conformational exploration. Among these, molecular dynamics (MD)-inspired generative models such as AlphaFlow and BioEmu show strong potential for capturing conformational ensembles. In this study, we benchmark these models alongside physics-based MD simulations to evaluate their ability to detect cryptic pockets in proteins. Identifying such transient pockets remains a vital goal in drug discovery, as they can offer new avenues for targeting proteins traditionally challenging to modulate. We also assess two specialized residue-level predictors, PocketMiner and CryptoBank. Using the interferon inhibitory domain of Zaire Ebola VP35 (VP35), TEM-1 {beta}-lactamse with the M182T substitution (TEM {beta}-lactamase), and their mutants, we test whether each method can detect pockets and capture the effects of point mutations known to enhance or suppress pocket formation. All methods successfully identify pockets in VP35 and distinguish between opening and closing mutants. However, in TEM, where pocket opening is subtle, the methods perform inconsistently. These results highlight the promise of AI-based and simulation-based strategies in cryptic pocket discovery while pointing to the need for further improvements to achieve robust, system-independent predictions.

11
LigandForge: A Web Server for Structure-Guided De Novo Drug Design

Nada, H.; Sipos-Szabo, L.; Bajusz, D.; Keseru, G.; Gabr, M.

2026-04-03 bioinformatics 10.64898/2026.03.31.715741 medRxiv
Top 0.1%
43.3%
Show abstract

Despite advances in computational drug discovery, de novo drug design remains hindered by high licensing costs and the need for specialized programming expertise. We present LigandForge, a webserver for structure-guided de novo ligand generation. LigandForge integrates structural validation and binding-site characterization; voxel-based property grid construction for spatial mapping of electrostatics and hydrophobicity; chemistry-aware fragment assembly; multi-objective lead optimization; and retrosynthetic feasibility analysis. The platform utilizes a structure-guided framework to assemble molecules from curated fragment libraries while enforcing physicochemical constraints, including molecular weight, LogP, and hybridization states. Generated molecules are refined via reinforcement learning and genetic algorithms which are subsequently evaluated using composite metrics such as the quantitative estimate of drug-likeness. By leveraging RDKit for cheminformatics and NGL viewer for real-time 3D visualization, LigandForge provides a synthesis-aware environment that bridges the gap between macromolecular structural data and experimentally feasible lead compounds without requiring local software installation.

12
Transcriptome-based lead generation, ligand- and structure-based prioritization and experimental validation of TLR5-activating molecules

Jain, A.; Hungharla, H.; Subbarao, N.; Tandon, V.; Ahmad, S.

2026-02-26 bioinformatics 10.64898/2026.02.25.707690 medRxiv
Top 0.1%
42.1%
Show abstract

Current in silico drug discovery protocols ubiquitously depend on lead generation using a ligand-based approach in which novel leads are generated by fragment-signature matching or by a structure-based search involving molecular docking and conformational dynamics. None of them incorporates cellular contexts in which these drugs ultimately operate, leaving the task to a later stage of optimization leading to a high failure rate. Incorporating systems-level responses of drugs in an early stage of lead generation can significantly address this concern but has not been sufficiently explored. In this work, we employ a systems-level approach using connectivity map (CMAP) library to generate leads against a challenging system of a TLR pathway. Starting with gene expression data of TLR5 activation by its natural ligand, we generated molecular leads using CMAP and rigorously analyzed their validity using ligand and structure-based approaches, and helping to prioritize top hits. Experimental validation using ELISA-based antibody assay confirmed the activation of TLR5 by each of the top nine prioritized leads with their dose-dependent patterns suggesting that some of them may actually interact with the TLR signaling pathway in a complex manner. Although, demonstrated on TLR5, the proposed framework is intuitively scalable to other lead generation and optimization tasks.

13
Cyclic peptides space: The methodology of sequence selection to cover the comprehensive physical properties

Tsuchihashi, R.; Kinoshita, M.

2026-03-12 bioinformatics 10.64898/2026.03.10.710724 medRxiv
Top 0.1%
42.1%
Show abstract

Cyclic peptides have emerged as a pivotal modality for next-generation therapeutics, due to their superior biocompatibility, high selectivity, and structural stability. While AI-driven peptide design has advanced rapidly, conventional optimization algorithms are often constrained by initialization biases, which impede the efficient exploration of the vast chemical space. Here, we propose a novel methodology that integrates the protein language model ESM-2 with cyclic permutation averaging of embeddings to resolve this bottleneck. This approach establishes a comprehensive "peptide space", a high-dimensional vector representation that encapsulates the physicochemical and structural attributes of cyclic peptides. Our analysis reveals that random sequence selection results in a heterogeneous distribution within this space, potentially underrepresenting specific functional regions. Conversely, navigating this defined peptide space enables the selection of libraries that uniformly span diverse molecular properties. In a proof-of-concept study designing binders for {beta}2-microglobulin ({beta}2m), we demonstrate that initial sequences uniformly sampled from our peptide space yield superior candidates more efficiently than those derived from random selection. Furthermore, this framework facilitates the quantitative assessment of mutational perturbations on global peptide properties, supporting rational decision-making for both broad exploration and local optimization. This "peptide space" concept provides a foundational framework for defining appropriate search boundaries and enhancing computational efficiency in AI-mediated drug discovery. Graphic Abstract O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=172 SRC="FIGDIR/small/710724v1_ufig1.gif" ALT="Figure 1"> View larger version (48K): org.highwire.dtl.DTLVardef@1dd903eorg.highwire.dtl.DTLVardef@128f941org.highwire.dtl.DTLVardef@1041e13org.highwire.dtl.DTLVardef@1527b25_HPS_FORMAT_FIGEXP M_FIG C_FIG

14
A Systematic Benchmark for Peptide Property Prediction

Dong, X.; Yang, K.; Wu, T.; Li, P.; Gao, L.

2026-02-10 bioinformatics 10.64898/2026.02.09.704773 medRxiv
Top 0.1%
42.0%
Show abstract

Accurate prediction of peptide physicochemical properties and biological activities is critical for rational peptide design and high-throughput screening. However, current research is often constrained by heterogeneous data sources and inconsistent evaluation standards, which hinder fair comparisons and reliable assessments of model generalization. In this work, we present PPB, a peptide property prediction benchmark designed to evaluate model performance with an emphasis on realistic generalization across both classification and regression tasks. By applying unified biological filtering criteria, we systematically curated and standardized 15 datasets comprising 161,571 unique sequences, spanning a wide range of physicochemical properties and functional activities. We benchmarked seven representative architectures--encompassing traditional machine learning, deep learning, and pre-trained language models--alongside diverse feature encoding schemes. Furthermore, we investigated the impact of random versus homology-based (sequence similarity) data splitting strategies on model robustness. To facilitate community access, we developed the PPB web server (http://ppb.molmatrix.com/index.html), which provides centralized resources for standardized dataset downloads, interactive visualization of benchmark results, and detailed evaluation protocols. Author summaryPeptides are short amino acid chains essential for biological functions and drug discovery. While AI models have accelerated peptide property prediction, the field lacks a unified standard to fairly compare these methods, often leading to inconsistent results and overoptimistic performance estimates. In this study, we introduce the Peptide Property Benchmark (PPB), a comprehensive framework featuring 15 standardized datasets and over 160,000 sequences. We systematically evaluated diverse AI paradigms, including traditional machine learning and advanced protein language models. Our results demonstrate that large-scale pre-trained models--the biological equivalent of large language models--offer superior accuracy and stability, particularly for small or complex datasets. Crucially, our analysis reveals a "clustering bottleneck": standard tools used to group proteins based on similarity often fail when applied to short peptides, causing data to fragment excessively. This suggests that traditional strategies for testing model generalization may be less effective for peptides than previously assumed. To facilitate community progress, we provide an online platform for standardized data and evaluation. This work establishes a rigorous foundation for developing more reliable AI tools for the next generation of peptide-based therapeutics.

15
The selectivity implications of docking libraries with greater and lesser similarities to bio-like molecules

Hall, B. W.; Sakamoto, K.; Huang, X.-P.; Irwin, J. J.; Shoichet, B. K.; Roth, B. L.

2026-02-04 biophysics 10.64898/2026.02.02.703317 medRxiv
Top 0.1%
41.0%
Show abstract

As virtual libraries have expanded into the tens of billions via make-on-demand chemistry, their similarity to metabolites, natural products, and drugs ("bio-like" molecules) has rapidly diminished. Despite this divergence, molecular docking of these ultra-large libraries has yielded molecules at higher experimental hit-rates and with improved affinities. The structural divergence from bio-like space raises the possibility that molecules from these ultra-large libraries have improved selectivity. Just as plausibly, if hit-rates on-target are divorced from similarity to bio-like molecules, so too may be selectivity against off-targets. Here, we test whether docking hits for the 5-HT2A serotonin receptor from ultra-large libraries are more selective than those from smaller and more bio-like "in-stock" libraries. Chemoinformatic similarity predicts that docking actives from the in-stock library have more off-targets than the more chemically novel hits emerging from docking the ultra-large library. This may reflect the bias of the known, however, as when tested experimentally at scale against 318 GPCRs, both 16 agonists from the ultra-large library and 20 actives from the in-stock library had similar numbers of off-targets. While the ultra-large library hits are more sub-type selective for the 5-HT2A over the 5-HT2B and 5-HT2C receptors, overall these results may suggest that selectivity against off-targets, like affinity and hit-rates for on-targets, is divorced from library similarity to bio-like molecules.

16
DESPOT: Direction-Enhanced Scoring POTentials

Poelmans, R.; Bruncsics, B.; Arany, A.; Van Eynde, W.; Shemy, A.; Moreau, Y.; Voet, A. R.

2026-04-02 bioinformatics 10.64898/2026.03.31.714140 medRxiv
Top 0.1%
38.8%
Show abstract

Knowledge-based potentials (KBPs) have long been used to score protein-ligand interactions, yet existing formulations remain isotropic, capturing only distance dependencies and neglecting the directional preferences that govern molecular recognition. Here, we introduce Direction-Enhanced Scoring POTentials (DESPOT), an anisotropic knowledge-based framework that unifies pose scoring and binding-site characterisation within a single probabilistic model. The new probabilistic formulation used in DESPOT naturally supports directional modelling through atom type-specific local reference frames and symmetry-aware geometric discretisation. It also supports steric exclusion, encoded as a dedicated void state that explicitly captures the probability that a spatial bin remains unoccupied. The anisotropic interaction profiles learned by DESPOT reveal systematic directional preferences for interactions such as hydrogen bonds, aromatic interactions, and halogen bonds, that extend beyond idealised geometric models. Evaluation on the CASF-2016 benchmark shows that DESPOT sub-stantially outperforms isotropic KBPs in all pose-discrimination and virtual screening tasks (p << 0.0001 for all enrichment factors), with the largest gains arising from its ability to penalise geometrically implausible poses. Constrained energy minimisation of training structures proves strongly beneficial for the derivation of KBPs, while our train-test leakage analysis reveals that overfitting is an underestimated and understudied issue for KBPs. DESPOT provides a data-driven framework for direction-aware modelling of protein-ligand interactions, with applications in pose scoring, binding-site characterisation, and structure-based design.

17
A Quantum Lens on Molecular Design: A Machine-Learned Energy Function from Interacting Quantum Atoms.

Hoffmann, M.; Kazimir, A.; Oesterreich, T.; Kaermer, L.; Engelberger, F.; Meiler, J.; Lamers, C.

2026-03-05 bioinformatics 10.64898/2026.03.03.709242 medRxiv
Top 0.1%
38.3%
Show abstract

Accurate predictions of the interactions (covalent bonds and non-covalent contacts between atoms) in a molecular system require scalable, accurate, and interpretable energy functions. While classical force fields and knowledge-based energy functions struggle to capture key electronic effects, quantum chemistry approaches such as density functional theory (DFT) provide the necessary accuracy but remain computationally demanding. Furthermore, gaining insight into interactions requires energy decomposition schemes. The Interacting Quantum Atoms (IQA) scheme is exceptionally attractive, offering a chemically intuitive, electron density (ED) topologically based separation into intra- and interatomic contributions, however its high computational cost remains a significant barrier for application to larger systems or tasks like ligand screening in drug discovery. We address these limitations by introducing a novel machine learning (ML) framework to predict accurate energies derived from the IQA scheme together with a comprehensive dataset of molecular systems and their calculated IQA decomposed energies. It enables the rapid and accurate prediction of DFT single point energies and dissects these energies in a physically meaningful and chemically intuitive manner. Our method predicts all intra-atomic energies and inter-atomic interaction energies (covalent and non-covalent) within a defined distance cutoff, providing an energy function that decomposes the total energy into specific atomic contributions. This advance makes the IQA method viable for analyzing interaction energies in applications previously inaccessible due to computational expense, such as elucidating ligand-binding mechanisms and informing rational drug design.

18
Assessment of Generative De Novo Peptide Design Methods for G Protein-Coupled Receptors

Junker, H.; Schoeder, C. T.

2026-03-02 bioinformatics 10.64898/2026.02.26.708415 medRxiv
Top 0.1%
37.9%
Show abstract

G protein-coupled receptors (GPCRs) play an ubiquitous role in the transduction of extracellular stimuli into intracellular responses and therefore represent a major target for the development of novel peptide-based therapeutics. In fact, approximately 30% of all non-sensory GPCRs are peptide-targeted, representing a blueprint for the design of de novo peptides, both as pharmacological tools and therapeutics. The recent advances of deep learning-based protein structure generation and structure prediction offer a multitude of peptide design stategies for GPCRs, yet confidence metrics rarely correlate with experimental success. In the context of peptides, this problem is exacerbated due to the lack of elaborate tertiary structures in peptides, raising the question of whether this is due to inadequate sampling or insufficient scoring. In this two-part benchmark, we addressed this question by first simulating the validation process of 124 unique known GPCR-peptide complexes using AlphaFold2 Initial Guess, Boltz-2 and RosettaFold3. We then assessed the peptide sampling capabilities of the respective generative methods BindCraft, BoltzGen and RFdiffusion3. Our results indicate that current design pipelines primarily suffer from significant confidence overestimation for misplaced peptides in the validation phase across all three prediction methods. We further highlight occurrences of significant memorization in both prediction as well as generation of peptides. While all generative methods sample backbone space sufficiently, their simultaneous sequence generation remains subpar and can be partially recovered through the use of ProteinMPNN. Taken together, our benchmark offers guidance for the design of peptides specifically using deep learning-based pipelines. Autor summaryDeep learning-based protein design is revolutionizing computational biology and development of such tools is progressing rapidly with increasing attention from both academic and non-academic institutions. Their applicability and performance is often assessed from an all-purpose objective, with implicit bias towards larger protein-protein interactions. Due to their size, peptides therefore present an edge case where performance is known to decrease compared to larger, more structured proteins. Here, we present a benchmark specifically for the deep learning-based design of peptides targeting G protein-coupled receptors (GPCRs), a major therapeutic drug target family, assessing the generation of novel GPCR-targeting peptides and the validation of these designs separately. Our results show that generative methods sample potential peptide placements and orientations sufficiently but validation fails to differentiate valid from invalid designs, indicating that the so-called scoring problem remains unsolved. Although focusing on a specific use-case, our results are generalizable to the broader field of protein design. Consequently, it can offer guidance for peptide-specific design applications and can contribute to the development and improvement of new methods.

19
Extending the MAD Toolbox: New Polymer Builder and Enhanced Martini Database

Marin, R.; Hilpert, C.; Grunewald, F.; Valerio, M.; Borges, L.; Janczarski, S.; Rossini, N.; Marrink, S.-J.; Telles de Souza, P.; Launay, G.

2026-01-26 bioinformatics 10.64898/2026.01.23.700524 medRxiv
Top 0.1%
37.4%
Show abstract

The MArtini Database (MAD) web server (https://mad.ens-lyon.fr/) has been updated with new tools, models, and capabilities to support a broader range of molecular systems for the Martini coarse-grained (CG) force field. The most notable addition is the MAD:Polymer Builder, enabling the automated construction of CG polymer models with varying architectures and complexities. The server now incorporates the latest developments in Martini 3, providing enhanced control within the MAD:Molecule Builder over the conversion of all-atom structures into CG representations, including the implementation of G[o]Martini 3 for protein complexes and water-protein interaction biases. The MAD:Polymer Builder and MAD:Molecule Builder are both adapted to work with intrinsically disordered proteins and domains. Substantial progress has been made in expanding the MAD:Database, making a growing library of Martini-ready compounds readily accessible across the entire MAD ecosystem. These advances position MAD as a comprehensive and evolving platform for the preparation of diverse systems in CG molecular simulations.

20
Discovery of TDP-43 aggregation inhibitors via a hybrid machine learning framework

Kapsiani, S.; Vora, S.; Fernandez-Villegas, A.; Kaminski, C. F.; Läubli, N. F.; Kaminski Schierle, G. S.

2026-02-14 bioinformatics 10.64898/2026.02.12.705375 medRxiv
Top 0.1%
35.0%
Show abstract

TAR DNA-binding protein 43 (TDP-43) aggregation is a hallmark of several neurodegenerative diseases, including amyotrophic lateral sclerosis and frontotemporal dementia. Recent therapeutic efforts have highlighted the potential of small molecules capable of inhibiting TDP-43 aggregation; however, no effective treatments currently exist. Here, we developed a hybrid machine learning approach combining graph neural network (GNN) embeddings with traditional chemical descriptors and biological target annotations. Using XGBoost as the final classifier enabled model interpretability through SHAP analysis, allowing the identification of key chemical features and target annotations associated with TDP-43 anti-aggregation activity. Complementary Monte Carlo Tree Search analysis highlighted specific chemical substructures linked to predicted activity. By screening an external library of 3,853 small molecules, the model identified two compounds not previously evaluated against TDP-43 aggregation, namely berberrubine and PE859. Molecular docking analysis revealed that both compounds interact favourably with the TDP-43 RNA recognition motif (RRM) domain through distinct binding modes. Experimental validation showed that both compounds significantly reduced TDP-43 aggregation in HEK cells. Further testing in Caenorhabditis elegans expressing human TDP-43 demonstrated that PE859 significantly rescued locomotor defects, while berberrubine showed partial improvement. This work establishes a hybrid machine learning approach for accelerating small molecule drug discovery, yielding two promising therapeutic candidates for TDP-43 proteinopathies.